Deep flanking sequence engineering for efficient promoter design using DeepSEED

Designing promoters with desirable properties is essential in synthetic biology. Human experts are skilled at identifying strong explicit patterns in small samples, while deep learning models excel at detecting implicit weak patterns in large datasets. Biologists have described the sequence patterns of promoters via transcription factor binding sites (TFBSs). However, the flanking sequences of cis-regulatory elements, have long been overlooked and often arbitrarily decided in promoter design. To address this limitation, we introduce DeepSEED, an AI-aided framework that efficiently designs synthetic promoters by combining expert knowledge with deep learning techniques. DeepSEED has demonstrated success in improving the properties of Escherichia coli constitutive, IPTG-inducible, and mammalian cell doxycycline (Dox)-inducible promoters. Furthermore, our results show that DeepSEED captures the implicit features in flanking sequences, such as k-mer frequencies and DNA shape features, which are crucial for determining promoter properties.


Train/Test/Validation Set Splitting and DeepSEED performance comparison
To develop the prediction model, we partitioned the samples into three sets: 80% for the training set, 10% for the validation set, and 10% for the test set.This partitioning was applied to both the E. coli and mammalian cell predictors.The E. coli promoter predictor achieved a high Pearson correlation coefficient (PCC) of 0.74 compared with our previous CNN-based predictive model 4 (PCC = 0.65) in the E. coli promoter dataset (Supplementary Fig. 19).Regarding the training of the generator, as it involves unsupervised learning, we did not apply specific dataset partitioning applied.Instead, we monitored the k-mer frequencies during the training process.Notably, for E. coli promoters, we observed a strong correlation between the k-mer frequency (k = 4 to 6) of the natural sequences and the DeepSEED-designed sequences at the global scale, which showed a high PCC compared to the sequences generated by our previous Wasserstein generative adversarial network-gradient penalty (WGAN-GP) model 4 (Supplementary Fig. 4d).

DeepSEED framework for promoter design
The DeepSEED framework was created by two deep learning models (conditional generative adversarial model, activity prediction model) and one optimization algorithm (genetic algorithm).The code is available on GitHub (https://github.com/WangLabTHU/DeepSEED). Briefly, the role of the conditional generative adversarial model is to estimate the conditional probability distribution of functional promoters based on the 'seed' inputs, then the synthetic promoters could be sampled from the conditional distribution.The role of the promoter activity prediction model is to estimate the gene expression activity of promoters.The role of the genetic algorithm is to combine two models to optimize the promoter's activity.Promoter design happened in three steps: (1) Conditional generative adversarial model was trained by the 'seed' sequences and their corresponding natural promoter sequences in the training dataset (Fig. 1b).(2) Activity prediction model was trained by the natural promoter sequences and their activity in the training dataset (Fig. 1b).
(3) After two models were trained, the generator of the conditional generative adversarial model and the activity prediction model were concatenated, in which the output of the generator was used as the input of the activity prediction model (Fig. 1a).Then the genetic algorithm was used to design the synthetic promoters based on the 'seed' sequences.The synthetic sequences obey the functional promoter distribution (learned by conditional generative adversarial model) and will show high activity (learned by activity prediction model) (Fig. 1c).

Details of conditional GANs (cGANs), predictor training and genetic algorithm optimization cGANs training:
Vanilla GANs have been trained on the unsupervised training data, i.e. promoter sequences.During the training, the input of the vanilla generator is the latent vector, and the output of the vanilla generator is the synthetic promoters.The role of the vanilla discriminator is to classify the synthetic and natural promoters.To integrate expert prior knowledge into promoter design, we used cGANs.During the training, the input of the generator is the 'seed' sequences, which is the concatenated vector of the latent vector and seed vector, and the output of the generator is the synthetic promoter (Supplementary Fig. 1, left).Here, the 'seed' vector determined the location and nucleobase of the knowledge-defined region, and the latent vector determined the location of the model-defined region.The inputs of the discriminator of cGANs are the sample pair of 'seed' sequences and generated/natural sequences, in which the 'seed' sequences deliver the expert knowledge to the discriminator (Supplementary Fig. 1, middle).The discriminator is trained five times, whereas the generator is trained one time in each iteration.
Predictor training: During the predictor training, the natural sequences from the training set are set as input, and the mean squared error (MSE) between predictor outputs and their ground truth activities are set as the loss to optimize the predictor.

Genetic algorithm optimization:
For the promoter design problem, we used the genetic algorithm to optimize the flanking sequences based on the 'seed' assignment.The initial population of the genetic algorithm consists of random vectors.The hybrid networks of the generator (from cGANs) and predictor were used to estimate the fitness score of each seed sequence.The random vectors in the 'seed' sequences were used as the initial population, and they were optimized by the genetic algorithm to find the final solutions.

Details of IPTG-inducible promoter design in E. coli
Different from constitutive promoter design in E. coli, the promoter sequences from Johns et al. were firstly annotated by -10 and -35 regions, then annotated several regions based on their relative position between lacO sites and -10/-35 regions (Supplementary Fig. 12).For example, if we have 165-bp natural promoter sequence with annotated -10 and -35 regions, which was shown as follows: 'GTCAATTTCAACTTAAAACTAATTTTGAAGAAGTCATTCAAAGACATGCTTTATTT AGAAAGGAACAACAATTCTGGGAAATAATATTTTTTAGTAATTTCAGACAGAATA TATAGATAAGGTATAATATTAAATACAAAATTTATATAACAAGGGGGATTATTC' Then based on the 2lacO, 3lacO, and 4lacO relative position, we have ' The flanking sequences of -10, -35 regions and lacO sites were optimized by genetic algorithm.With 100 iterations of optimization, the synthetic promoters with the best predicted property scores were chosen for experimental validation.The detailed sequences are provided in Supplementary Data 1.

Details of Dox-inducible promoter design in mammalian cells
Two different datasets were prepared for cGANs 6 and predictor training 7 , respectively.HEK293 enhancer dataset accumulated by Wang et al. 6 was used to train the cGANs.26604 sequences with 150-bp length were trimmed from the active enhancer region provided in the dataset.These sequences were annotated by 1,205 motifs from JASPAR 2022 vertebrate datasets 8 , then generating 'seed' sequences with the following patterns: 'GTTCCCTGNNNNNNNNNNNNNNNNNNNNNNNNNNNNAGGCAGGGAGAGGCT GGGCTGNNNNNNNNNNNNNNNNNNCCTGCTGGGTGTTGCCTAGGAGAGGAG AAAAAGCCCCGACGCCAGAAATGGANNNNNNNNNNNNACCTCGGAGGG' where 'GTTCCCTG', 'AGGCAGGGAGAGGCTGGGCTG', 'CCTGCTGGGTGTTGCCTAGGAGAGGA GAAAAAGCCCCGACGCCAGAAATGGA' and 'ACCTCGGAGGG' patterns represent motif sequences, and 'N' represents flanking regions of motifs.These annotated 'seed' sequences and their enhancer sequences were used for training the cGANs.To train the predictor, 328, 795 sequences and their transcriptional activities provided by Ernst et al. 7 were prepared.After training the cGANs and predictor, 3tetO 'seed' sequences obtained by expert knowledge were prepared as follows: 'NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNTC CCTATCAGTGATAGAGANNNNNNNNNNNNNNNNTCCCTATCAGTGATAGAGA NNNNNNNNNNNNNNNNNTCCCTATCAGTGATAGAGANNNNNNNNNNN' (TCCCTATCAGTGATAGAGA: tetO motif) With 100 iterations of optimization, the synthetic promoters with the best predicted property scores were chosen for experimental validation.The detailed sequences are provided in Supplementary Data 1.
E. coli 29248 promoter sequences with 165bp length were collected from Johns et al., and their activities were measured by RNA-seq 5 .Promoter sequences were annotated by -10/-35 motifs, then generating seed sequences with the following patterns: 'NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN TTGACANNNNNNNNNNNNNNTATAATNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNN' 11601 sequences with distinct -10 and -35 regions were chosen from the dataset.Their promoter sequences, 'seed' sequences, and activities were used for training cGANs and predictors.For E. coli promoter design, we select three template promoters from the iGEM parts registry (BBa_J23119, BBa_J23118, and BBa_J23114, http://parts.igem.org/Promoters/Catalog/Constitutive),and annotated the -10 and -35 regions, and generating 'seed' sequences as follows: BBa_J23119 'seed' sequences, 165bp.Random sequences were used to extend the initial promoter sequences to 165bp) BBa_J23118 'seed' sequences, 165bp.Random sequences were used to extend the initial promoter sequences to 165bp) BBa_J23114 'seed' sequences, 165bp.Random sequences were used to extend the initial promoter sequences to 165bp) The flanking sequences of -10 and -35 regions were optimized by the genetic algorithm.With 60 iterations of optimization, the synthetic promoters with the best predicted property scores were chosen for experimental validation.The detailed sequences are provided in Supplementary Data 1.